IEICE global.ieice.org Site

Author Search Result

[Author] Wei ZHAO(32hit)

21-32hit(32hit)

Noise Robust Feature Scheme for Automatic Speech Recognition Based on Auditory Perceptual Mechanisms
Shang CAI Yeming XIAO Jielin PAN Qingwei ZHAO Yonghong YAN

PAPER-Speech and Hearing

Vol:
E95-D No:6
Page(s):
1610-1618
Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.
Policy Optimization for Spoken Dialog Management Using Genetic Algorithm
Hang REN Qingwei ZHAO Yonghong YAN

PAPER-Spoken dialog system

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2499-2507
The optimization of spoken dialog management policies is a non-trivial task due to the erroneous inputs from speech recognition and language understanding modules. The dialog manager needs to ground uncertain semantic information at times to fully understand the need of human users and successfully complete the required dialog tasks. Approaches based on reinforcement learning are currently mainstream in academia and have been proved to be effective, especially when operating in noisy environments. However, in reinforcement learning the dialog strategy is often represented by complex numeric model and thus is incomprehensible to humans. The trained policies are very difficult for dialog system designers to verify or modify, which largely limits the deployment for commercial applications. In this paper we propose a novel framework for optimizing dialog policies specified in human-readable domain language using genetic algorithm. We present learning algorithms using user simulator and real human-machine dialog corpora. Empirical experimental results show that the proposed approach can achieve competitive performance on par with some state-of-the-art reinforcement learning algorithms, while maintaining a comprehensible policy structure.
Discriminative Pronunciation Modeling Using the MPE Criterion
Meixu SONG Jielin PAN Qingwei ZHAO Yonghong YAN

LETTER-Speech and Hearing

Pubricized:
2014/12/02
Vol:
E98-D No:3
Page(s):
717-720
Introducing pronunciation models into decoding has been proven to be benefit to LVCSR. In this paper, a discriminative pronunciation modeling method is presented, within the framework of the Minimum Phone Error (MPE) training for HMM/GMM. In order to bring the pronunciation models into the MPE training, the auxiliary function is rewritten at word level and decomposes into two parts. One is for co-training the acoustic models, and the other is for discriminatively training the pronunciation models. On Mandarin conversational telephone speech recognition task, compared to the baseline using a canonical lexicon, the discriminative pronunciation models reduced the absolute Character Error Rate (CER) by 0.7% on LDC test set, and with the acoustic model co-training, 0.8% additional CER decrease had been achieved.
A Two-Fold Cross-Validation Training Framework Combined with Meta-Learning for Code-Switching Speech Recognition
Zheying HUANG Ji XU Qingwei ZHAO Pengyuan ZHANG

LETTER-Speech and Hearing

Pubricized:
2022/06/20
Vol:
E105-D No:9
Page(s):
1639-1642
Although end-to-end based speech recognition research for Mandarin-English code-switching has attracted increasing interests, it remains challenging due to data scarcity. Meta-learning approach is popular with low-resource modeling using high-resource data, but it does not make full use of low-resource code-switching data. Therefore we propose a two-fold cross-validation training framework combined with meta-learning approach. Experiments on the SEAME corpus demonstrate the effects of our method.
Fuzzy Matching of Semantic Class in Chinese Spoken Language Understanding
Yanling LI Qingwei ZHAO Yonghong YAN

PAPER-Natural Language Processing

Vol:
E96-D No:8
Page(s):
1845-1852
Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.
A One-Pass Real-Time Decoder Using Memory-Efficient State Network
Jian SHAO Ta LI Qingqing ZHANG Qingwei ZHAO Yonghong YAN

PAPER-ASR System Architecture

Vol:
E91-D No:3
Page(s):
529-537
This paper presents our developed decoder which adopts the idea of statically optimizing part of the knowledge sources while handling the others dynamically. The lexicon, phonetic contexts and acoustic model are statically integrated to form a memory-efficient state network, while the language model (LM) is dynamically incorporated on the fly by means of extended tokens. The novelties of our approach for constructing the state network are (1) introducing two layers of dummy nodes to cluster the cross-word (CW) context dependent fan-in and fan-out triphones, (2) introducing a so-called "WI layer" to store the word identities and putting the nodes of this layer in the non-shared mid-part of the network, (3) optimizing the network at state level by a sufficient forward and backward node-merge process. The state network is organized as a multi-layer structure for distinct token propagation at each layer. By exploiting the characteristics of the state network, several techniques including LM look-ahead, LM cache and beam pruning are specially designed for search efficiency. Especially in beam pruning, a layer-dependent pruning method is proposed to further reduce the search space. The layer-dependent pruning takes account of the neck-like characteristics of WI layer and the reduced variety of word endings, which enables tighter beam without introducing much search errors. In addition, other techniques including LM compression, lattice-based bookkeeping and lattice garbage collection are also employed to reduce the memory requirements. Experiments are carried out on a Mandarin spontaneous speech recognition task where the decoder involves a trigram LM and CW triphone models. A comparison with HDecode of HTK toolkits shows that, within 1% performance deviation, our decoder can run 5 times faster with half of the memory footprint.
A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition
Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO

LETTER-Human-computer Interaction

Pubricized:
2021/04/30
Vol:
E104-D No:8
Page(s):
1391-1394
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
Improved End-to-End Speech Recognition Using Adaptive Per-Dimensional Learning Rate Methods
Xuyang WANG Pengyuan ZHANG Qingwei ZHAO Jielin PAN Yonghong YAN

LETTER-Acoustic modeling

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2550-2553
The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.
Speeding up Deep Neural Networks in Speech Recognition with Piecewise Quantized Sigmoidal Activation Function
Anhao XING Qingwei ZHAO Yonghong YAN

LETTER-Acoustic modeling

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2558-2561
This paper proposes a new quantization framework on activation function of deep neural networks (DNN). We implement fixed-point DNN by quantizing the activations into powers-of-two integers. The costly multiplication operations in using DNN can be replaced with low-cost bit-shifts to massively save computations. Thus, applying DNN-based speech recognition on embedded systems becomes much easier. Experiments show that the proposed method leads to no performance degradation.
Short Text Classification Based on Distributional Representations of Words
Chenglong MA Qingwei ZHAO Jielin PAN Yonghong YAN

LETTER-Text classification

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2562-2565
Short texts usually encounter the problem of data sparseness, as they do not provide sufficient term co-occurrence information. In this paper, we show how to mitigate the problem in short text classification through word embeddings. We assume that a short text document is a specific sample of one distribution in a Gaussian-Bayesian framework. Furthermore, a fast clustering algorithm is utilized to expand and enrich the context of short text in embedding space. This approach is compared with those based on the classical bag-of-words approaches and neural network based methods. Experimental results validate the effectiveness of the proposed method.
Improve Multichannel Speech Recognition with Temporal and Spatial Information
Yu ZHANG Pengyuan ZHANG Qingwei ZHAO

LETTER-Speech and Hearing

Pubricized:
2018/04/06
Vol:
E101-D No:7
Page(s):
1963-1967
In this letter, we explored the usage of spatio-temporal information in one unified framework to improve the performance of multichannel speech recognition. Generalized cross correlation (GCC) is served as spatial feature compensation, and an attention mechanism across time is embedded within long short-term memory (LSTM) neural networks. Experiments on the AMI meeting corpus show that the proposed method provides a 8.2% relative improvement in word error rate (WER) over the model trained directly on the concatenation of multiple microphone outputs.
Security Consideration for Deep Learning-Based Image Forensics
Wei ZHAO Pengpeng YANG Rongrong NI Yao ZHAO Haorui WU

LETTER-Image Recognition, Computer Vision

Pubricized:
2018/08/24
Vol:
E101-D No:12
Page(s):
3263-3266
Recently, image forensics community has paid attention to the research on the design of effective algorithms based on deep learning technique. And facts proved that combining the domain knowledge of image forensics and deep learning would achieve more robust and better performance than the traditional schemes. Instead of improving algorithm performance, in this paper, the safety of deep learning based methods in the field of image forensics is taken into account. To the best of our knowledge, this is the first work focusing on this topic. Specifically, we experimentally find that the method using deep learning would fail when adding the slight noise into the images (adversarial images). Furthermore, two kinds of strategies are proposed to enforce security of deep learning-based methods. Firstly, a penalty term to the loss function is added, which is the 2-norm of the gradient of the loss with respect to the input images, and then an novel training method is adopt to train the model by fusing the normal and adversarial images. Experimental results show that the proposed algorithm can achieve good performance even in the case of adversarial images and provide a security consideration for deep learning-based image forensics.

21-32hit(32hit)

Author Search Result

[Author] Wei ZHAO(32hit)

Noise Robust Feature Scheme for Automatic Speech Recognition Based on Auditory Perceptual Mechanisms

Policy Optimization for Spoken Dialog Management Using Genetic Algorithm

Discriminative Pronunciation Modeling Using the MPE Criterion

A Two-Fold Cross-Validation Training Framework Combined with Meta-Learning for Code-Switching Speech Recognition

Fuzzy Matching of Semantic Class in Chinese Spoken Language Understanding

A One-Pass Real-Time Decoder Using Memory-Efficient State Network

A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition

Improved End-to-End Speech Recognition Using Adaptive Per-Dimensional Learning Rate Methods

Speeding up Deep Neural Networks in Speech Recognition with Piecewise Quantized Sigmoidal Activation Function

Short Text Classification Based on Distributional Representations of Words

Improve Multichannel Speech Recognition with Temporal and Spatial Information

Security Consideration for Deep Learning-Based Image Forensics

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles